An Approximation-Based Data Structure for Similarity Search

نویسندگان

  • Roger Weber
  • Stephen Blott
چکیده

Many similarity measures for multimedia retrieval, decision support, and data mining are based on underlying vector spaces of high dimensionality. Data-partitioning index methods for such spaces (e.g. grid-les, quad-trees, R-trees, X-trees, etc.) generally work well for low-dimensional spaces, but perform poorly as dimensionality increases|a phenomenon which has become known as thèdimensional curse'. In this paper, we rst provide an analysis of the nearest-neighbor search problem in high-dimensional vector spaces. Under the assumptions of uniformity and independence, we establish bounds on the average performance of three important classes of data-partitioning techniques. We then introduce the vector-approximation le (VA-File), a method which overcomes the diiculties of high dimensionality by following not the data-partitioning approach of conventional index methods, but rather a lter-based approach. A VA-File contains a compact, geometric approximation for each vector. By rst scanning these smaller approximations, only a small fraction of the vectors themselves must be visited. Thus, the VA-File acts as a simple lter, much as a signature le is a lter. Performance is evaluated on the basis of both synthetic and real data sets, and compared to that of the R ?-tree and the X-tree. We show that performance does not degrade, and even improves with increased dimensionality. Both our analytical and our experimental results suggest that the VA-File is generally the preferred method for similarity search over moderate and large data sets with dimensionality in excess of around ten.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An improved opposition-based Crow Search Algorithm for Data Clustering

Data clustering is an ideal way of working with a huge amount of data and looking for a structure in the dataset. In other words, clustering is the classification of the same data; the similarity among the data in a cluster is maximum and the similarity among the data in the different clusters is minimal. The innovation of this paper is a clustering method based on the Crow Search Algorithm (CS...

متن کامل

A New Similarity Measure Based on Item Proximity and Closeness for Collaborative Filtering Recommendation

Recommender systems utilize information retrieval and machine learning techniques for filtering information and can predict whether a user would like an unseen item. User similarity measurement plays an important role in collaborative filtering based recommender systems. In order to improve accuracy of traditional user based collaborative filtering techniques under new user cold-start problem a...

متن کامل

Verification of an Evolutionary-based Wavelet Neural Network Model for Nonlinear Function Approximation

Nonlinear function approximation is one of the most important tasks in system analysis and identification. Several models have been presented to achieve an accurate approximation on nonlinear mathematics functions. However, the majority of the models are specific to certain problems and systems. In this paper, an evolutionary-based wavelet neural network model is proposed for structure definiti...

متن کامل

Identification of BKCa channel openers by molecular field alignment and patent data-driven analysis

In this work, we present the first comprehensive molecular field analysis of patent structures on how the chemical structure of drugs impacts the biological binding. This task was formulated as searching for drug structures to reveal shared effects of substitutions across a common scaffold and the chemical features that may be responsible. We used the SureChEMBL patent database, which prov...

متن کامل

PERFORMANCE BASED OPTIMAL SEISMIC DESIGN OF RC SHEAR WALLS INCORPORATING SOIL–STRUCTURE INTERACTION USING CSS ALGORITHM

In this article optimal design of shear walls is performed under seismic loading. For practical aims, a database of special shear walls is created. Special shear walls are used for seismic design optimization employing the charged system search algorithm as an optimizer. Constraints consist of design and performance limitations. Nonlinear behavior of the shear wall is taken into account and per...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1997